skip to main content


Search for: All records

Creators/Authors contains: "Puiu, Daniela"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Abstract

    Whitebark pine (WBP, Pinus albicaulis) is a white pine of subalpine regions in the Western contiguous United States and Canada. WBP has become critically threatened throughout a significant part of its natural range due to mortality from the introduced fungal pathogen white pine blister rust (WPBR, Cronartium ribicola) and additional threats from mountain pine beetle (Dendroctonus ponderosae), wildfire, and maladaptation due to changing climate. Vast acreages of WBP have suffered nearly complete mortality. Genomic technologies can contribute to a faster, more cost-effective approach to the traditional practices of identifying disease-resistant, climate-adapted seed sources for restoration. With deep-coverage Illumina short reads of haploid megagametophyte tissue and Oxford Nanopore long reads of diploid needle tissue, followed by a hybrid, multistep assembly approach, we produced a final assembly containing 27.6 Gb of sequence in 92,740 contigs (N50 537,007 bp) and 34,716 scaffolds (N50 2.0 Gb). Approximately 87.2% (24.0 Gb) of total sequence was placed on the 12 WBP chromosomes. Annotation yielded 25,362 protein-coding genes, and over 77% of the genome was characterized as repeats. WBP has demonstrated the greatest variation in resistance to WPBR among the North American white pines. Candidate genes for quantitative resistance include disease resistance genes known as nucleotide-binding leucine-rich repeat receptors (NLRs). A combination of protein domain alignments and direct genome scanning was employed to fully describe the 3 subclasses of NLRs. Our high-quality reference sequence and annotation provide a marked improvement in NLR identification compared to previous assessments that leveraged de novo-assembled transcriptomes.

     
    more » « less
  2. Abstract

    CHESS 3 represents an improved human gene catalog based on nearly 10,000 RNA-seq experiments across 54 body sites. It significantly improves current genome annotation by integrating the latest reference data and algorithms, machine learning techniques for noise filtering, and new protein structure prediction methods. CHESS 3 contains 41,356 genes, including 19,839 protein-coding genes and 158,377 transcripts, with 14,863 protein-coding transcripts not in other catalogs. It includes all MANE transcripts and at least one transcript for most RefSeq and GENCODE genes. On the CHM13 human genome, the CHESS 3 catalog contains an additional 129 protein-coding genes. CHESS 3 is available athttp://ccb.jhu.edu/chess.

     
    more » « less
  3. Abstract

    The genusQuercus, which emerged ∼55 million years ago during globally warm temperatures, diversified into ∼450 extant species. We present a high-quality de novo genome assembly of a California endemic oak,Quercus lobata, revealing features consistent with oak evolutionary success. Effective population size remained large throughout history despite declining since early Miocene. Analysis of 39,373 mapped protein-coding genes outlined copious duplications consistent with genetic and phenotypic diversity, both by retention of genes created during the ancient γ whole genome hexaploid duplication event and by tandem duplication within families, including numerous resistance genes and a very large block of duplicated DUF247 genes, which have been found to be associated with self-incompatibility in grasses. An additional surprising finding is that subcontext-specific patterns of DNA methylation associated with transposable elements reveal broadly-distributed heterochromatin in intergenic regions, similar to grasses. Collectively, these features promote genetic and phenotypic variation that would facilitate adaptability to changing environments.

     
    more » « less
  4. Bread wheat (Triticum aestivum) is a major food crop and an important plant system for agricultural genetics research. However, due to the complexity and size of its allohexaploid genome, genomic resources are limited compared to other major crops. The IWGSC recently published a reference genome and associated annotation (IWGSC CS v1.0, Chinese Spring) that has been widely adopted and utilized by the wheat community. Although this reference assembly represents all three wheat subgenomes at chromosome-scale, it was derived from short reads, and thus is missing a substantial portion of the expected 16 Gbp of genomic sequence. We earlier published an independent wheat assembly (Triticum_aestivum_3.1, Chinese Spring) that came much closer in length to the expected genome size, although it was only a contig-level assembly lacking gene annotations. Here, we describe a reference-guided effort to scaffold those contigs into chromosome-length pseudomolecules, add in any missing sequence that was unique to the IWGSC CS v1.0 assembly, and annotate the resulting pseudomolecules with genes. Our updated assembly, Triticum_aestivum_4.0, contains 15.07 Gbp of non-gap sequence anchored to chromosomes, which is 1.2 Gbps more than the previous reference assembly. It includes 108,639 genes unambiguously localized to chromosomes, including over 2,000 genes that were previously unplaced. We also discovered more than 5,700 additional gene copies, facilitating the accurate annotation of functional gene duplications including at the Ppd-B1 photoperiod response locus. 
    more » « less
  5. Understanding the genomic and environmental basis of cold adaptation is key to understand how plants survive and adapt to different environmental conditions across their natural range. Univariate and multivariate genome-wide association (GWAS) and genotype-environment association (GEA) analyses were used to test associations among genome-wide SNPs obtained from whole-genome resequencing, measures of growth, phenology, emergence, cold hardiness, and range-wide environmental variation in coastal Douglas-fir (Pseudotsuga menziesii). Results suggest a complex genomic architecture of cold adaptation, in which traits are either highly polygenic or controlled by both large and small effect genes. Newly discovered associations for cold adaptation in Douglas-fir included 130 genes involved in many important biological functions such as primary and secondary metabolism, growth and reproductive development, transcription regulation, stress and signaling, and DNA processes. These genes were related to growth, phenology and cold hardiness and strongly depend on variation in environmental variables such degree days below 0c, precipitation, elevation and distance from the coast. This study is a step forward in our understanding of the complex interconnection between environment and genomics and their role in cold-associated trait variation in boreal tree species, providing a baseline for the species’ predictions under climate change. 
    more » « less
  6. We sequenced the genome of the North American groundhog, Marmota monax , also known as the woodchuck. Our sequencing strategy included a combination of short, high-quality Illumina reads plus long reads generated by both Pacific Biosciences and Oxford Nanopore instruments. Assembly of the combined data produced a genome of 2.74 Gbp in total length, with an N50 contig size of 1,094,236 bp. To annotate the genome, we mapped the genes from another M. monax genome and from the closely related Alpine marmot, Marmota marmota , onto our assembly, resulting in 20,559 annotated protein-coding genes and 28,135 transcripts. The genome assembly and annotation are available in GenBank under BioProject PRJNA587092 . 
    more » « less
  7. Abstract The current human reference genome, GRCh38, represents over 20 years of effort to generate a high-quality assembly, which has benefitted society 1,2 . However, it still has many gaps and errors, and does not represent a biological genome as it is a blend of multiple individuals 3,4 . Recently, a high-quality telomere-to-telomere reference, CHM13, was generated with the latest long-read technologies, but it was derived from a hydatidiform mole cell line with a nearly homozygous genome 5 . To address these limitations, the Human Pangenome Reference Consortium formed with the goal of creating high-quality, cost-effective, diploid genome assemblies for a pangenome reference that represents human genetic diversity 6 . Here, in our first scientific report, we determined which combination of current genome sequencing and assembly approaches yield the most complete and accurate diploid genome assembly with minimal manual curation. Approaches that used highly accurate long reads and parent–child data with graph-based haplotype phasing during assembly outperformed those that did not. Developing a combination of the top-performing methods, we generated our first high-quality diploid reference assembly, containing only approximately four gaps per chromosome on average, with most chromosomes within ±1% of the length of CHM13. Nearly 48% of protein-coding genes have non-synonymous amino acid changes between haplotypes, and centromeric regions showed the highest diversity. Our findings serve as a foundation for assembling near-complete diploid human genomes at scale for a pangenome reference to capture global genetic variation from single nucleotides to structural rearrangements. 
    more » « less
  8. Abstract Here the Human Pangenome Reference Consortium presents a first draft of the human pangenome reference. The pangenome contains 47 phased, diploid assemblies from a cohort of genetically diverse individuals 1 . These assemblies cover more than 99% of the expected sequence in each genome and are more than 99% accurate at the structural and base pair levels. Based on alignments of the assemblies, we generate a draft pangenome that captures known variants and haplotypes and reveals new alleles at structurally complex loci. We also add 119 million base pairs of euchromatic polymorphic sequences and 1,115 gene duplications relative to the existing reference GRCh38. Roughly 90 million of the additional base pairs are derived from structural variation. Using our draft pangenome to analyse short-read data reduced small variant discovery errors by 34% and increased the number of structural variants detected per haplotype by 104% compared with GRCh38-based workflows, which enabled the typing of the vast majority of structural variant alleles per sample. 
    more » « less
    Free, publicly-accessible full text available May 11, 2024
  9. Abstract The short arms of the human acrocentric chromosomes 13, 14, 15, 21 and 22 (SAACs) share large homologous regions, including ribosomal DNA repeats and extended segmental duplications 1,2 . Although the resolution of these regions in the first complete assembly of a human genome—the Telomere-to-Telomere Consortium’s CHM13 assembly (T2T-CHM13)—provided a model of their homology 3 , it remained unclear whether these patterns were ancestral or maintained by ongoing recombination exchange. Here we show that acrocentric chromosomes contain pseudo-homologous regions (PHRs) indicative of recombination between non-homologous sequences. Utilizing an all-to-all comparison of the human pangenome from the Human Pangenome Reference Consortium 4 (HPRC), we find that contigs from all of the SAACs form a community. A variation graph 5 constructed from centromere-spanning acrocentric contigs indicates the presence of regions in which most contigs appear nearly identical between heterologous acrocentric chromosomes in T2T-CHM13. Except on chromosome 15, we observe faster decay of linkage disequilibrium in the pseudo-homologous regions than in the corresponding short and long arms, indicating higher rates of recombination 6,7 . The pseudo-homologous regions include sequences that have previously been shown to lie at the breakpoint of Robertsonian translocations 8 , and their arrangement is compatible with crossover in inverted duplications on chromosomes 13, 14 and 21. The ubiquity of signals of recombination between heterologous acrocentric chromosomes seen in the HPRC draft pangenome suggests that these shared sequences form the basis for recurrent Robertsonian translocations, providing sequence and population-based confirmation of hypotheses first developed from cytogenetic studies 50 years ago 9 . 
    more » « less
    Free, publicly-accessible full text available May 11, 2024
  10. SUMMARY

    Drought is a major limitation for survival and growth in plants. With more frequent and severe drought episodes occurring due to climate change, it is imperative to understand the genomic and physiological basis of drought tolerance to be able to predict how species will respond in the future. In this study, univariate and multitrait multivariate genome‐wide association study methods were used to identify candidate genes in two iconic and ecosystem‐dominating species of the western USA, coast redwood and giant sequoia, using 10 drought‐related physiological and anatomical traits and genome‐wide sequence‐capture single nucleotide polymorphisms. Population‐level phenotypic variation was found in carbon isotope discrimination, osmotic pressure at full turgor, xylem hydraulic diameter, and total area of transporting fibers in both species. Our study identified new 78 new marker × trait associations in coast redwood and six in giant sequoia, with genes involved in a range of metabolic, stress, and signaling pathways, among other functions. This study contributes to a better understanding of the genomic basis of drought tolerance in long‐generation conifers and helps guide current and future conservation efforts in the species.

     
    more » « less